67 research outputs found

    Large margin methods for partner specific prediction of interfaces in protein complexes

    Get PDF
    2014 Spring.The study of protein interfaces and binding sites is a very important domain of research in bioinformatics. Information about the interfaces between proteins can be used not only in understanding protein function but can also be directly employed in drug design and protein engineering. However, the experimental determination of protein interfaces is cumbersome, expensive and not possible in some cases with today's technology. As a consequence, the computational prediction of protein interfaces from sequence and structure has emerged as a very active research area. A number of machine learning based techniques have been proposed for the solution to this problem. However, the prediction accuracy of most such schemes is very low. In this dissertation we present large-margin classification approaches that have been designed to directly model different aspects of protein complex formation as well as the characteristics of available data. Most existing machine learning techniques for this task are partner-independent in nature, i.e., they ignore the fact that the binding propensity of a protein to bind to another protein is dependent upon characteristics of residues in both proteins. We have developed a pairwise support vector machine classifier called PAIRpred to predict protein interfaces in a partner-specific fashion. Due to its more detailed model of the problem, PAIRpred offers state of the art accuracy in predicting both binding sites at the protein level as well as inter-protein residue contacts at the complex level. PAIRpred uses sequence and structure conservation, local structural similarity and surface geometry, residue solvent exposure and template based features derived from the unbound structures of proteins forming a protein complex. We have investigated the impact of explicitly modeling the inter-dependencies between residues that are imposed by the overall structure of a protein during the formation of a protein complex through transductive and semi-supervised learning models. We also present a novel multiple instance learning scheme called MI-1 that explicitly models imprecision in sequence-level annotations of binding sites in proteins that bind calmodulin to achieve state of the art prediction accuracy for this task

    Deep and self-taught learning for protein accessible surface area prediction

    Get PDF
    ASA captures the degree of burial or surface accessibility of a protein residue. It is a very important indicator of the behavior of amino acids within a protein as well. It can be used to find protein interactions, interfaces, folding states, etc. Calculation of the ASA requires the presence of the structure of the protein. However, structure determination for proteins is expensive and requires significant technical effort. As a consequence, the prediction of ASA is a very important and fundamental problem in Bioinformatics and Proteomics. In this work, we have investigated self-taught machine learning methods along with deep neural network to predict the residue level accessible surface area (ASA) of a protein. We have found that deep learning neural networks can predict the ASA of the residues in a protein accurately. Furthermore, the proposed deep learning based method does not require the use of computationally demanding features such as the position specific scoring matrix (PSSM) which have been used in previous works. A simple Blosum62 matrix based position dependent representation of amino acids in a sequence window gives comparable performance. This is particularly attractive for proteome wide prediction of ASA. We have used various self-taught learning schemes for obtaining an optimal feature representation from unlabeled data. These include a sparse and regularized autoencoder neural network and a dictionary based learning scheme. We have used unlabeled data from the protein universe in an attempt to improve the feature representation. We have also evaluated the performance of a stochastic gradient based predictor of accessible surface area for different feature representations

    Issues in performance evaluation for host–pathogen protein interaction prediction

    Get PDF
    The study of interactions between host and pathogen proteins is important for understanding the underlying mechanisms of infectious diseases and for developing novel therapeutic solutions. Wet-lab techniques for detecting protein–protein interactions (PPIs) can benefit from computational predictions. Machine learning is one of the computational approaches that can assist biologists by predicting promising PPIs. A number of machine learning based methods for predicting host–pathogen interactions (HPI) have been proposed in the literature. The techniques used for assessing the accuracy of such predictors are of critical importance in this domain. In this paper, we question the effectiveness of K-fold cross-validation for estimating the generalization ability of HPI prediction for proteins with no known interactions. K-fold cross-validation does not model this scenario, and we demonstrate a sizable difference between its performance and the performance of an alternative evaluation scheme called leave one pathogen protein out (LOPO) cross-validation. LOPO is more effective in modeling the real world use of HPI predictors, specifically for cases in which no information about the interacting partners of a pathogen protein is available during training. We also point out that currently used metrics such as areas under the precision-recall or receiver operating characteristic curves are not intuitive to biologists and propose simpler and more directly interpretable metrics for this purpose

    AMP0 : species-specific prediction of anti-microbial peptides using zero and few shot learning

    Get PDF
    Evolution of drug-resistant microbial species is one of the major challenges to global health. Development of new antimicrobial treatments such as antimicrobial peptides needs to be accelerated to combat this threat. However, the discovery of novel antimicrobial peptides is hampered by low-throughput biochemical assays. Computational techniques can be used for rapid screening of promising antimicrobial peptide candidates prior to testing in the wet lab. The vast majority of existing antimicrobial peptide predictors are non-targeted in nature, i.e., they can predict whether a given peptide sequence is antimicrobial, but they are unable to predict whether the sequence can target a particular microbial species. In this work, we have used zero and few shot machine learning to develop a targeted antimicrobial peptide activity predictor called AMP0. The proposed predictor takes the sequence of a peptide and any N/C-termini modifications together with the genomic sequence of a microbial species to generate targeted predictions. Cross-validation results show that the proposed scheme is particularly effective for targeted antimicrobial prediction in comparison to existing approaches and can be used for screening potential antimicrobial peptides in a targeted manner with only a small number of training examples for novel species. AMP0 webserver is available at http://ampzero.pythonanywhere.com

    Protein binding affinity prediction using support vector regression and interfecial features

    Get PDF
    In understanding biology at the molecular level, analysis of protein interactions and protein binding affinity is a challenge. It is an important problem in computational and structural biology. Experimental measurement of binding affinity in the wet-lab is expensive and time consuming. Therefore, machine learning approaches are widely used to predict protein interactions and binding affinities by learning from specific properties of existing complexes. In this work, we propose an innovative computational model to predict binding affinities and interaction based on sequence, structural and interface features of the interacting proteins that are robust to binding associated conformational changes. We modeled the prediction of binding affinity as classification and regression problem with least-squared and support vector regression models using structure and sequence features of proteins. Specifically, we have used the number and composition of interacting residues at protein complexes interface as features and sequence features. We evaluated the performance of our prediction models using Affinity Benchmark Dataset version 2.0 which contains a diverse set of both bound and unbound protein complex structures with known binding affinities. We evaluated our regression performance results with root mean square error (RMSE) as well as Spearman and Pearson's correlation coefficients using a leave-one-out cross-validation protocol. We evaluate classification results with AUC-ROC and AUC-PR Our results show that Support Vector Regression performs significantly better than other models with a Spearman Correlation coefficient of 0.58, Pearson Correlation score of 0.55 and RMSE of 2.41 using 3-mer and sequence feature. It is interesting to note that simple features based on 3-mer features and the properties of the interface of a protein complex are predictive of its binding affinity. These features, together with support vector regression achieve higher accuracy than existing sequence based methods

    AMAP : Hierarchical multi-label prediction of biologically active and antimicrobial peptides

    Get PDF
    Due to increase in antibiotic resistance in recent years, development of efficient and accurate techniques for discovery and design of biologically active peptides such as antimicrobial peptides (AMPs) has become essential. The screening of natural and synthetic AMPs in the wet lab is a challenge due to time and cost involved in such experiments. Bioinformatics methods can be used to speed up discovery and design of antimicrobial peptides by limiting the wet-lab search to promising peptide sequences. However, most such tools are typically limited to the prediction of whether a peptide exhibits antimicrobial activity or not and they do not identify the exact type of the biological activities of these peptides. In this work, we have designed a machine learning based model called AMAP for predicting biological activity of peptides with a specialized focus on antimicrobial activity prediction. AMAP used multi-label classification to predict 14 different types of biological functions of a given peptide sequence with improved accuracy in comparison to existing state of the art techniques. We have performed stringent performance analyses of the proposed method. In addition to cross-validation and performance comparison with existing AMP predictors, AMAP has also been benchmarked on recently published experimentally verified peptides that were not a part of our training set. We have also analyzed features used in this work and our analysis shows that the proposed predictor can generalize well in predicting biological activity of novel peptide sequences. A webserver of the proposed method is available at the URL: http://faculty.pieas.edu.pk/fayyaz/software.html#AMA

    CAFÉ-Map : context aware feature mapping for mining high dimensional biomedical data

    Get PDF
    Feature selection and ranking is of great importance in the analysis of biomedical data. In addition to reducing the number of features used in classification or other machine learning tasks, it allows us to extract meaningful biological and medical information from a machine learning model. Most existing approaches in this domain do not directly model the fact that the relative importance of features can be different in different regions of the feature space. In this work, we present a context aware feature ranking algorithm called CAFÉ-Map. CAFÉ-Map is a locally linear feature ranking framework that allows recognition of important features in any given region of the feature space or for any individual example. This allows for simultaneous classification and feature ranking in an interpretable manner. We have benchmarked CAFÉ-Map on a number of toy and real world biomedical data sets. Our comparative study with a number of published methods shows that CAFÉ-Map achieves better accuracies on these data sets. The top ranking features obtained through CAFÉ-Map in a gene profiling study correlate very well with the importance of different genes reported in the literature. Furthermore, CAFÉ-Map provides a more in-depth analysis of feature ranking at the level of individual examples
    • …
    corecore